Clustering Validation Scores

Clustering is an unsupervised learning technique, used in fields like text mining, image analysis, market segmentation, and bioinformatics. It is used to group similar objects/data points based on their similarity.

Since clustering lacks predefined labels, validating the quality of the resulting clusters is essential.

Clustering validation, which evaluates the goodness of clustering results, has long been recognized as one of the vital issues essential to the success of clustering applications. Cluster validation is the process of evaluating the quality and effectiveness of clustering results in unsupervised learning. It helps determine how well the data has been grouped into clusters and whether the clustering makes sense for the problem.

Clustering validation measures can be categorized into three main types: internal clustering validation, external clustering validation, and relative clustering validation. The main difference is whether or not external information is used for clustering validation.

  1. Internal Clustering Validation

It involves evaluating the clustering quality using only the data and the cluster assignments (no ground truth labels needed). Internal validation measures only rely on information in the data. The internal measures evaluate the goodness of a clustering structure without respect to external information. Hence internal validation measures can be used to choose the best clustering algorithm as well as the optimal cluster number without any additional information. In practice, external

information such as class labels is often not available in many application scenarios. Therefore, in a situation where there is no external information available, internal validation measures are the only option for cluster validation.

As the goal of clustering is to make objects within the same cluster similar and objects in different clusters distinct, internal validation measures are often based on the following two criteria.

I. Compactness: Intra-Cluster Similarity

It measures how closely related the objects in a cluster are. A group of measures evaluate cluster

compactness based on variance. Lower variance indicates better compactness. Also, numerous measures that estimate the cluster compactness based on distance, such as maximum or average pairwise distance, and maximum or average center-based distance.

One method to find compactedness is to determine the centroids of the different clusters, and the sum of squared (SSQ) distances is reported as the corresponding objective function. Smaller values of this measure are indicative of better cluster quality. This measure is obviously more optimized to distance-based algorithms, such as k-means, as opposed to a density-based method, such as DBSCAN. Another problem with SSQ is that the absolute distances provide no meaningful information to the user about the quality of the underlying clusters.

Other than SSQ, we can use the Average distance between all points and the cluster center, the Diameter of a Cluster(Maximum distance between any two points within a cluster), Mean Pairwise Distance, and Variance Within Clusters.

II. Separation: Inter-Cluster Dissimilarity

It measures how distinct or well-separated a cluster is from other clusters. For example, the pairwise

distances between cluster centers or the pairwise minimum distances between objects in different clusters are widely used as measures of separation. Also, measures based on density are used in some indices.

Example Metrics for Internal Validation:

1.1 Silhouette Score

It was proposed by Belgian statistician Peter Rousseeuw in 1987. The Silhouette Score is computed for each data point individually. Each point gets its own score based on:

We can then compute the average score of the silhouettes for all the objects(data points).

The silhouette ranges from −1 to +1, where a high value indicates that the object is well-matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. A clustering with an average silhouette width of over 0.7 is considered to be "strong", a value over 0.5 "reasonable" and over 0.25 "weak".

Assuming that the dataset has been clustered into “k” clusters. Then, the silhouette score for a data point i ∈ CR(data point in the cluster CR) is calculated as follows:

  1. Average intra-cluster distance a(i) is calculated with the formula:

  1. Nearest-cluster distance b(i) is calculated with the formula:

  1. We now define a silhouette (value) of one data point “i” as follows:

NOTE: The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance.

Thus the mean s (i) over all data of the entire dataset is a measure of how appropriately the data have been clustered. If there are too many or too few clusters, as may occur when a poor choice of k is used in the clustering algorithm, some of the clusters will typically display much narrower silhouettes((i.e., most points in the cluster have silhouette scores close to 0) than the rest. One can also increase the likelihood of the silhouette being maximized at the correct number of clusters by re-scaling the data using feature weights that are cluster-specific.

1.1.1 Silhouette Coefficient

To determine the natural number of clusters within a dataset, we can use the silhouette coefficient:

= mean s (i) over all data of the entire dataset for a specific number of clusters.

1.1.2 Drawbacks

1.1.3 Simplified Silhouette

To save costly computation, we can replace a(i) and b(i) (which calculate average of O(N2) pairwise distances) with a’(i) and b’(i) which are as follows:

The number of distances used in simplified silhouette computation is O(Nk), where k is the number of clusters, and the added benefit is that a’(i) is always defined.

1.2 Davies-Bouldin Index (DBI)

Introduced in 1979, this internal cluster validation metric is also based on compactedness and separation. It is calculated as the average similarity measure of each cluster with the cluster most similar to it. In this context, similarity is defined as the ratio between inter-cluster and intra-cluster distances. As such, this index ranks well-separated clusters with less dispersion as having a better score.

1.2.1 Some formulae

Let Ci be a cluster of data points. Let Xj be a data point assigned to cluster Ci.

For cluster Ci, let its centroid be Ai, and Ti be the number of points in the cluster.

Then the scatter within the cluster, Si, is defined as the average distance between the data points in cluster i and the centroid of the cluster. The formula for Si is:

Usually, the value of p is 2, which makes the distance a Euclidean-distance function. Many other distance metrics can be used, in the case of manifolds and higher dimensional data, where the euclidean distance may not be the best measure for determining the clusters.

NOTE: Typically, the clustering algorithm that we use has a distance function- typically Euclidean- distance. But, many times we can change this distance function- in that case, it is important to match the distance metric used in the formula of Si with the one used in our clustering algorithm.

The centroids of two clusters Ci and Cj are Ai and Aj respectively. If the dataset has n-dimensional points, let ak,i be the kth element of Ai. Similarly, the kth element of Aj would be ak,j. Then separation between the clusters i and j is given as:

Typically the value of p is 2.

1.2.2 Forming the metric

Let Ri,j be a measure of how good the clustering scheme is. For this, the separation between the two different clusters i and j should be as large as possible, and the ‘within-cluster’ scatter in cluster i has to be as low as possible.

Hence the Davies–Bouldin index is defined as the ratio of Si and Mi,j such that:

If we keep one cluster as it is and change cluster j to cluster k- which is at the same separation from cluster i as cluster j was, and the scatter within cluster k is less than the scatter within j, then the clustering scheme would be better for i and k rather than for i and j.

A solution that satisfies these properties is:

Then for a cluster i, Di is defined as the max value of Ri,j for different clusters j≠i.

Now we take the average of the Di’s over all the clusters. If N is the number of clusters,

DB is called the Davies–Bouldin index.

Lower index values indicate a better clustering result.

1.2.3 Intuition

These conditions constrain the index so defined to be symmetric and non-negative. Due to the way it is defined, as a function of the ratio of the within-cluster scatter, to the between-cluster separation, a lower value will mean that the clustering is better. It happens to be the average similarity between each cluster and its most similar one, averaged over all the clusters, where the similarity is defined as Si above. This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies–Bouldin index. This index thus defined is an average over all the i clusters, and hence a good measure of deciding how many clusters actually exist in the data is to plot it against the number of clusters it is calculated over. The number i for which this value is the lowest is a good measure of the number of clusters the data could be ideally classified into.

1.2.4 Draback and Soft DBI

DBI provides a measure of how good the clustering scheme is, but it does not guarantee the clusters are useful, meaningful, or optimal for retrieval.

The Davies–Bouldin Index has been extended to support soft clustering, using a fuzzy logic framework. This version incorporates membership values from algorithms like fuzzy c-means, where each element can belong to multiple clusters to varying degrees. As a result, the index evaluates not just compactness and separation, but also the degree of membership, making it more suitable for fuzzy clustering scenarios.

2)External Clustering Validation

External validation measures use external information not present in the data. Since external validation measures know the “true” cluster number in advance, they are mainly used for choosing an optimal clustering algorithm on a specific data set. On the other hand, internal validation measures can be used to choose the best clustering algorithm as well as the optimal cluster number

without any additional information. An example of an external validation measure is entropy, which evaluates the “purity” of clusters based on the given class labels.

NOTE: In practice, external information such as class labels is often not available in many application scenarios. However, when synthetic data is generated from known benchmarks, it is possible to associate cluster identifiers with the generated records.

In the context of real data sets, these goals can be approximately achieved with the use of class labels when they are available. The major risk with the use of class labels is that these labels are based on application-specific properties of that data set and may not reflect the natural clusters in the underlying data. Nevertheless, such criteria are still preferable to internal methods because they can usually avoid consistent bias in evaluations when used over multiple data sets.

Example Metrics for External Validation:

2.1 Rand Index and Adjusted Rand Index(ARI)

The Rand Index (named after William M. Rand) is a measure of the similarity between two data clusterings. It is used to compare a predicted clustering (e.g., from a clustering algorithm like K-Means) to a ground truth classification (if available), or to another clustering.

One can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm.

2.1.1 Definitions and Intuition

A Clustering is essentially partitions of the original dataset X={o1,o2,.....on}.

Hence a cluster is just a subset of X.

Let X and Y be the clusterings we need to compare. One of these is decided by our clustering algorithm and the other is typically a ground truth labeling or reference clustering used for evaluation.

Suppose X = {X1,X2,.....Xs} (s subsets)

And Y = {Y1,Y2,.....Yr} (r subsets)

Define the following:

The Rand Index measures the proportion of pairwise agreements relative to all possible pairs.

The number of possible agreements between the two clusterings is clearly a+b, and the number of all possible pairs is a+b+c+d, also equal to the number of ways to select two points out of the dataset.

The Rand index, R, is:

The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.

2.1.2 Contingency Table

The contingency table is an r*s matrix, where each entry nij represents the number of elements that are:

Hence:

Let us define a few terms corresponding to the contingency matrix

2.1.3 Adjusted Rand Index(ARI)

The Adjusted Rand Index using the Permutation Model is:

The Adjusted Rand Index (ARI) corrects the Rand Index (RI) for chance by considering how likely a given level of agreement between two clusterings would be due to random chance (i.e. permutations of labels).

It does this by comparing the observed similarity to what would be expected under a random model.

Traditionally, this correction uses the Permutation Model, where:

However, in practice, this model often doesn’t reflect reality — especially in methods like K-means, where:

Because of these limitations, newer versions of ARI have been developed that use different random models to better match real clustering behavior.

Though the Rand Index may only yield a value between 0 and +1, the adjusted Rand Index can yield negative values if the index is less than the expected index.

3.2 Normalized Mutual Information (NMI)

3.2.1 Information Content of an event

The information content, also called the surprisal or self-information, of an event E is a function that increases as the probability p ( E ) of an event decreases. When p ( E ) is close to 1, the surprisal of the event is low, but if p ( E ) is close to 0, the surprisal of the event is high. This relationship is described by the function:

3.2.2 Entropy of a Discrete Variable

Entropy Η (of a discrete random variable X , which takes values in the set X and is distributed according to p : X → [ 0 , 1 ] such that p ( x ) := P [ X = x ]) :

In the case of p ( x ) = 0 for some x ∈ X, the value of the corresponding summand 0 logb(0) is taken to be 0, which is consistent with the limit:

Here E is the expected value operator, and I is the information content of X. ⁡ I( X ) is itself a random variable.

3.2.3 Conditional Entropy

This quantity should be understood as the remaining randomness in the random variable X given the random variable Y.

The conditional entropy of two variables X and Y taking values from sets respectively, as:

3.2.4 What is Mutual Information(MI)

MI quantifies the "amount of information" obtained about one random variable by observing the other random variable.

Let ( X , Y ) be a pair of discrete random variables with values from the sets . . If their joint distribution is P ( X , Y ) and the marginal distributions are PX and PY, the mutual information is defined as

The mutual information of two jointly discrete random variables X and Y is calculated as a double sum:

where P(X,Y) is the joint probability mass function of X and Y, and PX and PY are the marginal probability mass functions of X and Y respectively.

3.2.5 Normalised Mutual Information(NMI)

Normalized Mutual Information is then calculated by normalizing the Mutual Information using the following formula:

Let clustering labels be the random variable C and ground truth labels T, then:

Scenario 1: Independence

🔹 Scenario 2: Perfect Correlation

🔹 Scenario 4: Noisy Channel Analogy

Resources